title: “chapter4.Rmd” output: html_document author: Faranak Halali, 24.11.2019

Week 4: Clustering and classification

This week’s task deals with clustering and classification methods. Boston dataset will be used.

Loading and exploring data

library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Interpretation: Boston dataset contains 506 rows (observations) with 14 variables (columns). It is about housing values in suburbs of Boston, USA. All the variables are numeric. Example variables are: crim = per capite crime rates, indus = proportion of non-retail business acres per town, nox = nitrogen oxides concentration (parts per 10 million).

Graphical overview and summary of the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

Interpretation: Summary of the variables shows the min and max values of each variable plus the quartiles (Q1, Q2 = median, Q3) and mean. For example for the variable crim: the mean value for the per capite crime rate is 3.61. The plot matrix shows the relationships between pairs of variables. For example, there is a low-steep positive relationship between nox and age.

Standardize the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

Interpretation: Here, we standardized (z-normalized) the dataset because different variables in the dataset may have different dimensions and measured on different scales. Looking into the results of data normalization, all the variables have now a mean of zero.

Create a categorical variable of the crime rate

bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Interpretation: Based on the its quantiles, the original continuous crim variables was converted to a 4-level categorical variable called crime. The categories of this new variable are: low, med_low, med_high, high.

Drop the old crime variable and add the new categorical crime variable

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Divide the dataset to train and test sets

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Save the correct classes from test data

correct_classes <- test$crime

Remove the crime variable from test data

test <- dplyr::select(test, -crime)

Linear discriminant analysis

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2574257 0.2301980 0.2623762 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.95572404 -0.8717686 -0.077423115 -0.8659269  0.4243107
## med_low  -0.08320563 -0.2841228 -0.007331936 -0.5354109 -0.1371654
## med_high -0.38396622  0.1742905  0.193349455  0.4089740  0.2369176
## high     -0.48724019  1.0170298 -0.012331882  1.0594552 -0.3723712
##                 age        dis        rad        tax     ptratio
## low      -0.8673354  0.8359584 -0.6987343 -0.7469021 -0.40935287
## med_low  -0.3178123  0.3271295 -0.5434660 -0.4529740 -0.09804605
## med_high  0.4338051 -0.3700372 -0.4101075 -0.2941221 -0.27200081
## high      0.8179427 -0.8481723  1.6390172  1.5146914  0.78181164
##               black       lstat       medv
## low       0.3795874 -0.77254980  0.5165063
## med_low   0.3163503 -0.08680645 -0.0202947
## med_high  0.1445800 -0.04448735  0.2382108
## high     -0.8182524  0.91972122 -0.7192353
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.11584078  0.72013904 -1.0322672
## indus    0.01601055 -0.23640390  0.1711228
## chas    -0.09513173  0.01919263  0.1404721
## nox      0.44781528 -0.80406056 -1.2890937
## rm      -0.10503721 -0.18839993 -0.1001859
## age      0.20836547 -0.38778595 -0.2733285
## dis     -0.07954802 -0.40979808  0.1570075
## rad      3.44394872  1.04639786 -0.3080947
## tax      0.02133046 -0.19994256  0.8278850
## ptratio  0.12992525  0.02247724 -0.3680662
## black   -0.13213314  0.02903654  0.1365207
## lstat    0.21135334 -0.19235218  0.5662801
## medv     0.20352002 -0.43355471 -0.1759290
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9571 0.0307 0.0122

Interpretation: out of 404 observations (5060.80), 100 belong to low crime category, 99 to med_low, 105 to med_high, and 100 to high category. Next, we see the means of scaled variables in each of the 4 crime categories. Coefficients of linear discriminants show the linear combination of predictor variables that are used to form the LDA decision rule. for example, LD1 = 0.132zn + 0.039indus - 0.069chas,…… . Proportion of trace is the percentage separation in crime classes achieved by each discriminant function. In this case, LD1 has the highest separation percentage.

Plot the LDA results

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Interpretation: I can say that the combination of ‘rad & zen & nox’ variables are the most influential separators of the crime categories.

Predict LDA

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       8        1    0
##   med_low    2      18        2    0
##   med_high   1      13       17    2
##   high       0       0        0   21

Interpretation: LDA estimates the probability of a new set of inputs belonging to every class. The output class is the one that has the highest probability. In the test data, 32 observations are in the low crime category and LDA predicts 16 of them to be in low and the other 16 to be in med_low crime category (50% correct prediction). Out of 24 observations in the med_low crime category the LDA model predicts 20 in med_low and 4 in low category (83% correct prediction). Out of 24 med_high observations, the model predicts 12 in med_low and 12 in med_high category (50% correct prediction). Of 21 high crime observations the model predicts all to belong to high category (100% correct prediction). So, the highest correct predictions belong to the 2 ends of the crime levels, i.e., low and high.

Reload and standardize the Boston dataset

library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

Calculate the Euclidian and Manhattan distances between the observations

dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man <- dist(boston_scaled, method = 'manhattan')
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Interpretation: I calculated the similarity (distance) between the observations by two methods: Euclidian distances (EuD) and Manhattan distances (MD). EuD is the length of the line segment connecting the observations. The median and mean EuD are 5.13 and 5.15, respectively. Manhattan method calculates the distances between points with a different approach using only the absolute (positive) values of the points. In this dataset, the median and mean MD are 13.7 and 14.7, respectively.

K-means clustering and optimal number of clusters

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

Visulaize the K-means clustering

library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')

Interpretation: I used the within cluster sum of squares (WCSS) to determine the optimal number of clusters. In the figure we are looking for an ‘elbow’ which demarks significant drop in rate of increase in WCSS. K=2 seems a good choice here.

Run K-means clustering with K=2 and its visulaization

km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Interpretation: Based on the figure, the variable ‘tax’in combination with most of the other variables distinguish the two clusters pretty well. Also, ’nox’&‘dis’ and ‘dis’&‘age’ distinguish the clusters better than the other variable combinations.

Bonus question

library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
km1 <- kmeans(boston_scaled, centers = 4)
pairs(boston_scaled, col = km1$cluster)

boston_scaled$k_cluster <- as.factor(km1$cluster)
lda.fit.1 <- lda(k_cluster ~ ., data = boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(boston_scaled$k_cluster)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit.1, myscale = 1)

Interpretation: I performed K-means clustering on the standardized dataset with defind number of clusters=4. 202 observations belonged to cluster 1, 123 to C2, 128 to C3 and 53 to C4. Cluster 4 includes much lower proportion of the observations than the other 3 clusters and this could indicate that maybe 4 clusters is not the perfect solution. In the biplot, I can say that the linear combination of ‘indus & rad & black & zn’ variables are the most influential separators of the clusters.

Super-bonus question

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

Interpretation: The two 3-D plots are similar from the position of the data points. In the second plot, where I defined the color argument, 4 different colors distinguish the data points and indicates to which crime category (low, med_low, med_high, high) each data point belongs.